Study of style effects on OCR errors in the MEDLINE database
نویسندگان
چکیده
The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process works by feeding words that have characters with less than 100% confidence (as determined automatically by the OCR engine) to a human operator who then must manually verify the word or correct the error. The majority of these errors are contained in the affiliation information zone where the characters are in italics or small fonts. Therefore only affiliation information data is used in this research. This paper examines the correlation between OCR errors and various character attributes in the MEDLINE database, such as font size, italics, bold, etc. and OCR confidence levels. The motivation for this research is that if a correlation between the character style and types of errors exists it should be possible to use this information to improve operator productivity by increasing the probability that the correct word option is presented to the human editor. We have determined that this correlation exists, in particular for the case of characters with diacritics.
منابع مشابه
A Study of Style Effects on OCR Errors in the MEDLINE Database
The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process ...
متن کاملEffects of saffron (Crocus sativus) on sexual dysfunction among men and women: A systematic review and meta-analysis
Objective: This systematic review and meta-analysis study evaluated the effect of saffron (Crocus sativus) on sexual dysfunction and its subscales (dimensions) among men and women. Material and Methods: <...
متن کاملGenerating Robust Features for Style-independent Labeling of Bibliographic Fields in Medical Journal Articles
Bibliographical data such as title, author, affiliation, and abstract are crucial for indexing biomedical journal articles. The Medical Article Records System (MARS) has been developed at the National Library of Medicine (NLM) to automate bibliographical data extraction for MEDLINE®, the NLM’s premier database of citations to the biomedical literature. The automatic extraction of bibliographic ...
متن کاملThe Relationship between Medication Errors with Job Satisfaction of Nurses in Pediatric Ward
Background Medication errors are known as a preventable cause of idiopathic damage in pediatrics. These errors could entail serious direct and indirect outcomes that often lead to disruptions in the health care system. The present study aimed to determine the relationship between medication errors with job satisfa...
متن کاملOcr-optical Character Recognition
Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. It is widely used to recognize and search text from electronic documents or to publish the text on a website. OCR is the machine replication of human reading and has been the subject of intensive research for more than three decades. OCR can be described...
متن کامل